LLM benchmarks: run weekly LLM benchmarks from website-managed models by bradleyshep · Pull Request #5324 · clockworklabs/SpacetimeDB

bradleyshep · 2026-06-15T14:05:12Z

Note 1: this requires a website PR to merge

Note 2:

I was able to run all workflow smoke tests successfully, including golden validation and dry-run benchmarks, except for the C# dry-run benchmark path. C# golden validation passes, but the C# benchmark dry run still fails intermittently/consistently on the runner despite several attempts to align its build/publish setup with the known-good smoketest path.

gh workflow run llm-benchmark-periodic.yml `
  --repo ClockworkLabs/SpacetimeDB `
  --ref bradley/fix-validate-goldens-ci `
  -f model_set=explicit `
  -f models="openrouter:openai/gpt-5.4-mini" `
  -f languages=rust,csharp,typescript `
  -f modes=guidelines `
  -f tasks=t_000_empty_reducers `
  -f dry_run=true

Description of Changes

This updates the LLM benchmark automation and runner plumbing.

Move periodic LLM benchmark and golden validation workflows from daily/nightly to weekly Monday UTC runs.
Add manual workflow inputs for benchmark smoke runs:
- model set: website-managed, local defaults, or explicit models
- languages, modes, categories, tasks
- dry-run mode
Build the local TypeScript SDK before TypeScript benchmark/golden validation runs.
Add support for fetching active/available benchmark models from the website API via --model-source remote.
Keep explicit --models ... working for manual/local overrides.
Add OpenRouter preflight checks before benchmark execution:
- checks key/account credits when available
- probes the selected model when credit balance cannot be checked
- supports OPENROUTER_ALLOW_UNCHECKED_CREDITS=1 escape hatch
- supports OPENROUTER_MIN_CREDITS / LLM_MIN_CREDITS
Force scheduled benchmark workflow runs through OpenRouter with LLM_VENDOR=openrouter, while preserving direct OpenAI support for local/manual use.
Improve benchmark publishing isolation:
- isolated SpacetimeDB CLI root per publish
- serialized C# benchmark publish concurrency
- local NuGet package references for generated C# benchmark projects
- Windows/PATH handling for TypeScript pnpm
Update default benchmark model routes to current model names/ids.
Update TypeScript golden answers for current SDK shape.

API and ABI breaking changes

None.

This adds benchmark-runner/workflow behavior and CLI options, but does not change SpacetimeDB runtime API or ABI.

Expected complexity level and risk

3/5

The changes are mostly isolated to the LLM benchmark runner and GitHub workflows, but the risk is moderate because they touch CI execution paths, local SDK build assumptions, website-managed model resolution, OpenRouter routing, and generated module publish behavior across Rust, C#, and TypeScript.

The most sensitive pieces are:

GitHub Actions workflow dispatch/manual input behavior.
Remote model registry parsing from the website.
C# benchmark publish behavior on the self-hosted runner.

Testing

cargo check -p xtask-llm-benchmark --bin llm_benchmark
cargo test -p xtask-llm-benchmark --bin llm_benchmark
cargo test -p xtask-llm-benchmark parses_active_available_model_routes
Manual GitHub Actions golden validation smoke runs for Rust, C#, and TypeScript.
Run a dry-run periodic benchmark workflow from this branch with one explicit OpenRouter model, one task, and all languages.
Run a website-dispatched dry-run benchmark and verify it sends model_set=explicit plus selected model/task inputs.

# Description of Changes Sets `DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1` only on the benchmark harness command that publishes generated C# modules. This keeps dotnet startup out of localized DateTime/TimeZoneInfo formatting on the CI runner, which was crashing before generated C# module publish could run. Stacked on #5324. ```bash gh workflow run llm-benchmark-periodic.yml \ --repo ClockworkLabs/SpacetimeDB \ --ref bot/debug-llm-csharp-publish \ -f model_set=explicit \ -f models="openrouter:openai/gpt-5.4-mini" \ -f languages=rust,csharp,typescript \ -f modes=guidelines \ -f tasks=t_000_empty_reducers \ -f dry_run=true ``` # API and ABI breaking changes None. # Expected complexity level and risk 1. CI benchmark harness environment fix. # Testing - [x] `cargo fmt --all` - [x] `cargo check --manifest-path tools/xtask-llm-benchmark/Cargo.toml` - [x] `ruby -e 'require "yaml"; YAML.load_file(".github/workflows/llm-benchmark-periodic.yml"); YAML.load_file(".github/workflows/llm-benchmark-validate-goldens.yml")'`\n- [x] `git diff --check` --------- Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>

cloutiertyler · 2026-06-16T14:50:22Z

-    false
+fn signal_killed_by(_status: &std::process::ExitStatus) -> Option<i32> {
+    None
 }


This whole transient thing is a little sus, but it's not a regression, so it's fine.

cloutiertyler · 2026-06-16T14:52:02Z

+
 /// Context limits for models accessed via OpenRouter.
 /// Uses the same limits as direct clients where known,
 /// falls back to a conservative default.


Seems a little weird to have these in the code, rather than pulling them from OpenRouter, but it's not a regression so I'll let it go.

Agreed. Initially I was using differnet providers, which had different context limits that were not reachable by api. I think cleaner long-term is just to use openrouter and get the context limits from there... Probably even just get rid of the direct OpenAI/other providers.

cloutiertyler

Seems generally fine and is low risk, so once we get CI to pass, I'm good to merge.

bradleyshep added 25 commits June 10, 2026 15:05

updates

a6382ca

Update provider.rs

711ff88

updates

e82f0ae

preflight credit checks; workflow update to use web

bcdb41d

weekly goldens; workflow refinements

f2179a2

Update publishers.rs

8d1d27e

golden fixes

d5957f2

Merge branch 'master' into bradley/fix-validate-goldens-ci

f1ae445

fixes

4c679e2

Update publishers.rs

4358ed5

updates

890be18

Update publishers.rs

480cedf

fixes

d4999e2

Update publishers.rs

e58523f

fixes

032afd1

Merge branch 'master' into bradley/fix-validate-goldens-ci

9eee265

match smoketest (fingers crossed?)

6037418

fix

2e6e02f

shrug

b2308b1

fix?

2b133b8

testing

7857671

test

ee38f7a

Update llm-benchmark-periodic.yml

77e2924

revert tests

63a9c34

preflight no error; vendor to openrouter in periodic

9596077

bradleyshep requested review from bfops, cloutiertyler and jdetter as code owners June 15, 2026 14:05

bradleyshep added 2 commits June 15, 2026 10:11

lints

65e4539

Merge branch 'master' into bradley/fix-validate-goldens-ci

4272cfd

clockwork-labs-bot mentioned this pull request Jun 15, 2026

Avoid .NET globalization crash in LLM benchmarks #5335

Merged

3 tasks

cloutiertyler reviewed Jun 16, 2026

View reviewed changes

cloutiertyler approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324

LLM benchmarks: run weekly LLM benchmarks from website-managed models#5324
bradleyshep wants to merge 28 commits into
masterfrom
bradley/fix-validate-goldens-ci

bradleyshep commented Jun 15, 2026

Uh oh!

cloutiertyler Jun 16, 2026

Uh oh!

cloutiertyler Jun 16, 2026

Uh oh!

bradleyshep Jun 16, 2026

Uh oh!

cloutiertyler left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bradleyshep commented Jun 15, 2026

Note 1: this requires a website PR to merge

Note 2:

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

cloutiertyler Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

cloutiertyler Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

bradleyshep Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

cloutiertyler left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants